In this report, Machine Learning techniques were applied to investigate Fraud at Enron Corporation which was a famous worldwide known accounting scandal that led the company to bankruptcy in 2001 and that was the largest corporate bankruptcy in U.S. and the most complex white-collar crime investigation in the FBI’s history at that time.
For this study, a dataset provided during the Udacity's course was used containing information from several people related to the Enron corporation where each person was labelled as a "Person Of Interest" or as a "Non-Person Of Interest" depending on the indictments and the posterior investigations conducted by the Police and the U.S. authorities.
In summary, Enron's leadership tricked regulators and authorities with fake holdings and off-the-books accounting practices during several years using fake or special companies and vehicles to hide its mountains of debt and toxic assets from investors and creditors and inflating their incomes under the umbrella of a manipulated Mark-to-Market accounting method.
This study aims to determine whether it can be identified that a certain person was a "Person of Interest" based on some finantial and specific information from such person. For that, Machine Learning techniques can be very useful to determine which commonalities are present in the available data for each class of people and to build unsupervised and/or supervised algorithms able to capture the features that contain the most useful information and to classify each person in the appropiate group. Ideally, as a possible application, machine learning algorithms like this study could be used by the police and investigators to help in their investigations focusing first the efforts on that people classified as a "Person of Interest" just by analyzing a certain set of data from them.
A lot of information about the Enron's case can be found in internet but some of the sources that were consulted in this study were:
Apart from these sources related to the Enron's scandal, other sources were also consulted along this report to make the required analyses. The main ones were:
The main objective of this study was to develop a classifier using Machine Learning techniques that, given a certain set of features for several people related to the Enron company, is able to determine whether the person to whom the data belongs to was a Person of Interest (POI) or not. For such objective, the following partial objectives/questions were answered:
The following python packages and functions were used along this report:
### Load python Functions
%load_ext autoreload
%autoreload 1
%aimport tester
%aimport poi_id
%aimport Data_Cleaning_Functions
### Import all packages that will be later on needed
import sys
import pickle
import pprint
import pandas as pd
import numpy as np
import math
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
# To export graphs to pdf
%matplotlib inline
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('png', 'pdf')
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2
from sklearn.calibration import calibration_curve
from sklearn.metrics import classification_report
from scipy import stats
from textwrap import wrap
### Import additional functions
sys.path.append("data/")
sys.path.append("tools/")
%aimport feature_format
### shows plots directly in the notebook or to export to pdf
%matplotlib notebook
### Generates a watermark with the version of python used to build this analysis
%load_ext watermark
%watermark
For this study, it was used the data about Enron company provided during the Udacity's Data Analyst NanoDegree course. Such data contained the following information:
The second file was actually included in the first pickle file as the labels of the dataset gathered under the "poi" key/column. The rest of information contained in the pickle file can be used to determine whether there is any relation between such data and the "poi" key/column that would allow to identify the "poi" value by using any of the other information, this is, to build a classifier using Machine Learning algorithms that, given a certain finantial or specific information about a certain person, the algorithm is able to determine if such person is a "poi" or not.
Note that the "email_address" information was discarded from this study as it is a personal and subjective data, different for each person and that at first has no relation with a "poi" classification.
The cleaning process was mainly focused on selecting the most interesting data and features for our analysis according to the objectives defined before, identifying the possible presence of outliers and NaN values that could affect to the conclusions.
In order to have a first look at the data, a scatter matrix was plotted for having a first feeling of the shape of the data and possible correlations between the different features as well as the issues we may found (i.e. outliers). Note that the "email_address" information was not used as it was not considered a feature.
### Define all features of Enron's people
features_list = ['poi','bonus','deferral_payments','deferred_income','director_fees',
'exercised_stock_options','expenses','from_messages','from_poi_to_this_person',
'from_this_person_to_poi','loan_advances','long_term_incentive','other',
'restricted_stock','restricted_stock_deferred','salary','shared_receipt_with_poi',
'to_messages','total_payments','total_stock_value']
### Load the corresponding Data with no modification
data = Data_Cleaning_Functions.loadData(features_list,total_removal=False,nanFlag=False)
### Represent a scatter matrix for the selected features
Data_Cleaning_Functions.plot_scatter_matrix(data,"maxmin",features_list)
The first thing that gives the attention on this figure is that in most of the scatter plots there seem to be a big difference between some few cases and the majority of the data. The reason of such behaviour was the presence of a "Total" sum entry, as it was found during the Nanodegree course. Therefore, such "TOTAL" entry was disregarded and the scatter matrix was generated again with the rest of data:
### Load the corresponding Data removing the TOTAL entry
data = Data_Cleaning_Functions.loadData(features_list,total_removal=True,nanFlag=False)
### Represent a scatter matrix for the selected features
Data_Cleaning_Functions.plot_scatter_matrix(data,"maxmin",features_list)
In this new figure, the ranges of the data look more reasonable but it is still observed a big difference between the number of points available for each feature, thus indicating the presence of NaN values.
Taking advantadge of "missingno" package, the presence of NaN values was represented in the following figure:
### Represent the amount of NaN values present for each feature
Data_Cleaning_Functions.show_NaN(data,features_list,nplots=2)
In this figure it is clearly seen that the amount of "NaN" values for some of the features (i.e. loan_advances) is huge and so, the information provided by such features will be very limited or even inconclusive. Therefore, it was decided to disregard those features which a ratio of "NaN" values higher than 70%. These discarded features were:
Checking again the remaining data, several "NaN" values were still present in all columns (except "poi") that were managed case by case depending on the needs. It was also observed that some rows contained very few data and there was even one case with no data at all (except from "poi" column), which was also removed.
### Define features of Enron's people to be kept
features_list = ['poi','bonus','deferred_income','exercised_stock_options','expenses','from_messages',
'from_poi_to_this_person','from_this_person_to_poi','long_term_incentive','other',
'restricted_stock','salary','shared_receipt_with_poi','to_messages','total_payments','total_stock_value']
### Load the corresponding Data for the selected features
data = Data_Cleaning_Functions.loadData(features_list,total_removal=True,nanFlag=False)
### Represent a scatter matrix for the selected features
Data_Cleaning_Functions.plot_scatter_matrix(data,"maxmin",features_list)
### Represent the amount of NaN values present for each feature>
Data_Cleaning_Functions.show_NaN(data,features_list,nplots=1)
In terms of the distributions of the data, the ranges of the data looked more reasonable now but there still were some points that significantly differed from the others which could be an indication of the presence of additional outliers.
Apart from the outliers, in the scatter matrix before it was also observed that there were other features that seem to be significantly correlated to each other, thus indicating that the dimensionality of the study most likely could be reduced without a significant loss of information.
In order to determine which features are the most suitable for our analysis without a significant loss of information, an initial Principal Component Analysis study was conducted to make a first selection of the most interesting features for our analysis. For that, the remaining outliers and NaN values were managed as follows:
In terms of outliers, at this initial stage of the analysis, it was decided to use a common criteria based on the z-score of the sample values and the probability for such value to happen. In this case, it was considered as an outlier those values whose z-score was higher than 3 times the standard deviation (less than 0.13% probability).
Note that using this kind of criteria some assumptions are being made in terms of the normality of the data considering that the distribution of each feature follow a Gaussian distribution. Looking at the histograms of the scatter matrix above it was observed that such assumption may be correct for some of the features but for some others maybe this assumption was not the most suitable one. Nevertheless, at this stage and according to the Central Limit Theorem, the assumption of Normality was deemed suitable enough for this first filtering.
In terms of NaN values, three different options were considered:
In addition, those rows where all features contain NaN values were also discarded.
Before performing the PCA study, all features were also scaled to avoid undesired effects due to the differences between the features scales. In this case, a Scaling based on Maximum and Minimum values of the features was done once the outliers were removed from the sample data.
Note a Standard Scaler could have been also used here according to the normality assumption mentioned before. However, in this case it was considered enough to use a scaler based on the range without outliers.
Once the dataset was clean and armonized, the PCA study was conducted to determine the number of components that contained most of the variance of the data. In this case, a 95% explained variance ratio was selected as a criteria to determine such number of components.
### Load the corresponding Data for the selected features and removing the TOTAL entry and all blank rows
data = Data_Cleaning_Functions.loadData(features_list,total_removal=False,nanFlag=False,zerosFlag=True)
### Perform a PCA study to determine the number of components that contain most of the variance of the data
normdata,labels,features,pca,vardf, nbest = Data_Cleaning_Functions.components_selection(data,features_list,True,"maxmin",True,"Zscore",3,True,"variance",0.94)
According to this graph, the PCA study determined that there were 12 main components between the initial 15 selected features that contained at least 95% of the variance of the data.
Note: Due to the replacement of the NaN values by a randomly generated sample of data (aiming to keep constant the variance of the features), the results of the PCA study may vary from 12 to 13 components containing the 95% of the variance because sometimes with 12 components the ratio drops to around 94% and 13 are returned as a solution. Nevertheless, 12 components were deemed a good balance for the following analyses (thus 94% threshold was used to assure the algorithm always considers 12 components).
The obtained components were represented in a heatmap to also provide a visualization of the contribution of each feature on the different components:
### Shows a Heatmap with the obtained components
Data_Cleaning_Functions.components_heatmap(vardf)
In order to select the 12 best components identified before, a "SelectKBest" method was used together with a "chi2" scoring method to determine which of the features seems more independent from the class and so, let's say more irrelevant for the classification.
### Select the best features and transform the data into it
clean_data,new_features,features_selected = Data_Cleaning_Functions.best_features_selection('kbest', features_list[1:],normdata,
features,labels,nbest,showPlot=True)
### Show the final Selected features
print("The final features selected for the analysis were:")
pprint.pprint(features_selected)
Note that, despite 12 features were selected, according to the scores and p-values observed in the previous graphs, there are 2 features "exercised_stock_options" and "total_stock_value" that seem to be the features providing the best scores while some other resulted in very low values. Such behaviour could be due to a correlation between those two features so it will have to be taken into account during the posterior analysis. Note that such two features would have been selected if a SelectPercentile method would have been used instead as follows:
### Select the best features and transform the data into it
opt_data,opt_features,features_optimum = Data_Cleaning_Functions.best_features_selection('percentile', features_list[1:],
normdata,features,labels,nbest=10,showPlot=False)
### Show the features giving the highest scores
print("The features that provide the highest scores are:")
pprint.pprint(features_optimum)
The resultant cleaned dataset with the selected 12 features was again represented in a scatter_matrix for a final visually check and was saved as pickle file in a dictionary format (as the original dataset) as an input for the Exploratory Data Analysis section.
### Represent a scatter matrix for the selected features
Data_Cleaning_Functions.plot_scatter_matrix(clean_data,"maxmin",clean_data.columns)
### Save the cleaned data set into a pickle file
Data_Cleaning_Functions.save_cleaned_data(clean_data,"final_project_dataset_CLEANED.pkl")
Summary of Data Cleaning Process:
After having loaded and cleaned the dataset, Objectives described at the beggining of this report were still valid as the main aim of this study was to find a "poi" classifier with the available data, adapting the algorithm properly to each need.
In the Data Cleaning section, the original dataset for "poi" identification contained 19 features (apart from "email_address" field) and was reduced to the 12 best features that contained at least the 95% (or 94%) explained variance ratio of the data.
First, a new scatter matrix was represented again but in this case differentiating with a different color (Orange) the data points corresponding to a "poi" to try to visually identify whether a certain kind of classifier may be more appropiate for our analysis for example in case of a certain pattern is identified.
### Represent a scatter matrix for the selected features differentiating pois
poi_id.plot_scatter_matrix(clean_data,'poi')
In these plots, no clear trend in terms of distribution of the orange points (POI) with regards to the blue points (Non_POI) was observed for any of the features, thus apparently there was no "magic" feature(s) allowing to easily identify when a certain data point corresponds to a "poi" or not.
Therefore, a second PCA method was decided to be used but in this case in combination with a certain classifier in order to identify those really principal components that provide the best scoring accuracy for the selected classifier method and configuration.
However, in the scatter matrix above it was observed that there seems to be a high correlation between the "exercised_stock_options" and the "total_stock_value" features which is not good for machine learning algorithms, thus a previous PCA was also conducted only for those features to reduce them to just one component.
In terms of classifier, no clear clue from the scatter matrix could be obtained to decide at this point which classifier could be the best option (maybe a Decision Tree classifier could work well in this case to take into account the non-linearity of the problem-or maybe an ensemble method could be also appropiate as the size of the sample is not so high), thus several classifiers were tested to select the one giving the best results.
The complete study could be summarized as follows:
### TASK1: Loads the corresponding data from the stored pickle file
# Identifies the list of features to be used after the cleaning process
features_list = ['poi']
[features_list.append(x) for x in features_selected]
# Loads data
data = poi_id.task1_select_features("data/final_project_dataset_CLEANED.pkl",features_list)
### TASK2: Remove outliers using LOF algorithm
outdata = poi_id.task2_remove_outliers(data,50,0.1,0.9,True)
### TASK3: Creates new features by combining correlated features and scaling data
normdata, features_list, features, labels, my_dataset = poi_id.task3_tune_features(outdata,features_list,corrFlag=True,scaleFlag=True)
### TASK4a: Conducts a GridSearch over all candidates and possible configurations
clfs, features_train, labels_train, features_test, labels_test = poi_id.task4_classifiers_search(normdata,
features_list,features,
labels,'accuracy')
### TASK4b: Compares performance results for each type of classifier
poi_id.plot_classifiers_performance(clfs)
### TASK4c: Compares calibration results for each type of classifier
classReport = poi_id.task4_calibration_check(clfs, features_train, labels_train, features_test, labels_test)
### TASK5: Select best classifier option
clf = poi_id.task5_select_classifier(classReport, clfs, features_train, labels_train, features_test, labels_test,None)
### TASK6: Exports classifier, dataset and features list
poi_id.task6_dump_results(clf, my_dataset, features_list)
The main findings of this analysis can be summarized as follows (but note that results may vary a little if the Data Cleaning section is run several times as explained in the Analysis Limitations section):
Note: The GradientBoosting classifier was also identified as one of the best classifiers together with the ExtraTree classifier according to the selected criteria, thus the results using the best estimator for such classifier (using as a criteria the mean squared error, a minimum samples split of 5 and 50 estimators) were also obtained here below:
### Show results using the best estimator with GradientBoosting Classifier
poi_id.task5_select_classifier(classReport, clfs, features_train, labels_train, features_test, labels_test,'GB')
1. Summarize the goal of the project and how machine learning is useful in trying to accomplish it. Give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?
2. What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.
3. What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?
In this report, it was decided to conduct a dimensionality reduction step by means of a Principal Components Analysis followed by a non-ordered supervised Classification to try to identify on a first step those components containing the most useful information for the subsequent classification, iterating this process over all possible number of components and over certain parameters of the classifiers and making cross-validation to minimize the risk of biassed results. A total of 10 different classifiers were compared to each other in terms of accuracy score, time lapse for fitting the model, time lapse for scoring the model and in terms of the precision, recall and f-1 scores obtained for each of the classes during the predictions.
4. What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).
"Tuning the parameters of an algorithm" can be expressed like searching the best combination of parameters/settings of a certain model/classifier that provides the optimum configuration according to a certain predefined criteria, normally defined by a certain scoring method. Such step is normally an important step in Machine Learning algorithms because normally classifiers have several parameters that can be tuned and the performance and results of the algorithm can be completely different depending on the selected combination. For example, the plots provided in this report in "TASK4b)" clearly show how different the results of the classifiers can be dependending on the selected configuration (combination of parametes) of the classifier.
At the beggining of this section it was summarized the different parameters that were tuned for each of the 10 selected classifiers, normally searching for the best combination of at least 2 or 3 parameters, if possible. This fine-tuning process of the classifiers was also combined with a tuning process of the number of principal components feeding the classifier, thus trying to optimize not only the classifier outputs but also its inputs.
5. What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?
Within the scope of Machine Learning algorithms, validation is understood as the process of testing the developed model/algorithm with a dataset that has not been used for training or fitting the model/classifier, thus like simulating a real use of the model/classifier with new data. This process is also very important in Machine Learning algorithms because models can respond very well to the data they were trained for, but they can be very bad in a real situation when new data is used and a prediction/estimation is wanted. This is a classic mistake that one can make when developing a new model which is commonly known as "overfitting" which can be described as the tendency of a model to provide very good results for the dataset that was used to train the model, but it does not work properly when predictions are done on new data points.
One example of this could be the results obtained above with the AdaBoost classifier method that the accuracy obtained during training was most of the times 1 but then when testing a new dataset it dropped until 0.8 for a certain compination of parameters. This is an indication that the model can be overfitted and/or that the tuning of the parameters of the classifier was not good enough.
A common practice to validate the results of a certain model/classifier is to split the original dataset in two datasets: one for training the model and one for validation. In this study, 30% of the original dataset was reserve for validation before doing any fine-tuning of parameters and before training any model. Moreover, such split was done using a stratified method to avoid getting a biassed split (thus it is assure all clases are present in all splits).
Once the final classifier was selected, predictions were performed over this 30% reserved dataset to assess and validate the behaviour of the developed classifier.
6. Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.
Several evaluation metrics were used along this report for several purposes (chi2, z-score, accuracy, precision, recall,...). In particular for the final selected classifier, 3 scoring metrics were calculated over the predictions made during the final validation:
Once the complete analysis was conducted, the "tester.py" function provided by Udacity was used to test the obtained classifier using StratifiedShuffleSplit algorithm over 1000 iterations, calculating on each iteration the number of True/False Positives/Negatives to obtain a more realistic estimation of the metrics of the classifier. Note this is one of the requirements of the Rubrik of the project that the final classifier shall provide a precission ans recall scores of at least 0.3.
In this final validation step using "tester.py" function, the precission obtained with the selected Pipeline classifier PCA(n_components=5, random_state=42) - ExtraTreeClassifier(min_samples_split=10, random_state=42) was only precision=0.11765 and recall=0.03, very far from the values obtained during the previous analysis using only stratified split technique to divide the data set.
After revision of the algorithm and the analysis process, the reason of such discrepancies on the final results was associated to what was guessed when looking at the calibration curves of the obtained classifiers and the limited size of the sample of data, thus a StratifiedShuffleSplit algorithm would have been a better option to search the best estimator over a large number of testing samples. However, as the algorithm was built allowing to vary some of the options selected on the different steps, a Manual fine-tuning search was done to try to improve the final metrics of the classifier (so that, complying with the project's rubrik). Some of the things that were tried are summarized as follows:
Several attempts were conducted selecting manually different options getting precisions up to more than 0.6 but the recall values kept always around 0.2 at the maximum. At this point it was decided to run a loop search over some of the options that were manually selected during the analysis at the same time that the "tester.py" function is executed to validate with a StratifiedShuffleSplit strategy the different classifiers. The different options that were varied were:
The exit of the loop was set for the first classifier that, when using the "tester.py" function, provided a precision and recall values above 0.3. After around 127 iterations, a classifier was found meeting the requirements:
Selected Features:'poi', 'bonus', 'deferred_income', 'exercised_stock_options', 'expenses', 'from_messages', 'from_poi_to_this_person', 'long_term_incentive', 'restricted_stock', 'salary', 'shared_receipt_with_poi', 'total_stock_value'
Note: "total_payments" feature was discarded
Best Pipeline Classifier: ('PCA', PCA(n_components=11, random_state=42)) - ('SVM', SVC(C=100, degree=2, gamma=1, random_state=42))
The main limitation of the analysis conducted in this report was related to the limited number of samples (together with a considerable number of features) and the big ammount of NaN values present.
During the Data Cleaning process, NaN values were replaced by randomly generated samples of numbers trying to preserve the variance of the original features. Such decision had the advantage of preserving quite well the variance of the original data for the conducted PCA study but it had the disadvantage that, in the available dataset, several features had a similar weight in terms of scoring and/or p-values so, when trying to select the best features, different results were obtained when running the study several times (depending on the randomly generated samples).
In general, results varied as follows:
Such variation also affected to the posterior selection of classifiers, thus different final classifiers were obtained when running the study several times. Although this was not ideal, it was deemed acceptable as the variations on the selected features were due to similar contributions of some of the features, thus the impact of selecting one or another was not that important and, in addition, the study process was always built over a pipeline made by a first PCA step and a second CLASSIFIER step, thus the best number of principal components was always tuned during the gridsearch.
Note that, if the reader wants to get always the same classification results, the "Data Wrangling" section shall be executed only once, thus the same features will be always used in the "Data Exploration" section (as the data will be always loaded from a stored pickle file).
After the initial Data Cleaning process, 12 main features (containing finantial and specific information of people related to the Enron compony) were selected for developing a Person of Interest (POI) identifier based on Machine Learning algorithms.
During the main analysis, the following conclusions were obtained:
When testing the obtained classifier with a StratifiedShuffleSplit strategy using "tester.py" function, the obtained final precision was only 0.11765 and the recall 0.03, far from the requested 0.3 value.